256 research outputs found

    A Fine-grained Performance Model for GPU Architectures

    Get PDF
    The increasing programmability, performance, and cost/effectiveness of GPUs have led to a widespread use of such many-core architectures to accelerate general purpose applications. Nevertheless, tuning applications to efficiently exploit the GPU potentiality is a very challenging task, especially for inexperienced programmers. This is due to the difficulty of developing a SW application for the specific GPU architectural configuration, which includes managing the memory hierarchy and optimizing the execution of thousands of concurrent threads while maintaining the semantic correctness of the application. Even though several profiling tools exist, which provide programmerswith a large number of metrics and measurements, it is often difficult to interpret such information for effectively tuning the application. This paper presents a performance model that allows accurately estimating the potential performance of the application under tuning on a given GPU device and, at the same time, it provides programmers with interpretable profiling hints. The paper shows the results obtained by applying the proposedmodel for profiling commonly used primitives and real codes

    An Enhanced Profiling Framework for the Analysis and Development of Parallel Primitives for GPUs

    Get PDF
    Parallelizing software applications through the use of existing optimized primitives is a common trend that mediates the complexity of manual parallelization and the use of less efficient directive-based programming models. Parallel primitive libraries allow software engineers to map any sequential code to a target many-core architecture by identifying the most computational intensive code sections and mapping them into one ore more existing primitives. On the other hand, the spreading of such a primitive-based programming model and the different GPU architectures have led to a large and increasing number of thirdparty libraries, which often provide different implementations of the same primitive, each one optimized for a specific architecture. From the developer point of view, this moves the actual problem of parallelizing the software application to selecting, among the several implementations, the most efficient primitives for the target platform. This paper presents a profiling framework for GPU primitives, which allows measuring the implementation quality of a given primitive by considering the target architecture characteristics. The framework collects the information provided by a standard GPU profiler and combines them into optimization criteria. The criteria evaluations are weighed to distinguish the impact of each optimization on the overall quality of the primitive implementation. The paper shows how the tuning of the different weights has been conducted through the analysis of five of the most widespread existing primitive libraries and how the framework has been eventually applied to improve the implementation performance of a standard primitive

    Power-aware Performance Tuning of GPU Applications Through Microbenchmarking

    Get PDF
    Tuning GPU applications is a very challenging task as any source-code optimization can sensibly impact performance, power, and energy consumption of the GPU device. Such an impact also depends on the GPU on which the application is run. This paper presents a suite of microbenchmarks that provides the actual characteristics of specific GPU device components (e.g., arithmetic instruction units, memories, etc.) in terms of throughput, power, and energy consumption. It shows how the suite can be combined to standard profiler information to efficiently drive the application tuning by considering the three design constraints (power, performance, energy consumption) and the characteristics of the target GPU device

    Pro++: A Profiling Framework for Primitive-based GPU Programming

    Get PDF
    Parallelizing software applications through the use of existing optimized primitives is a common trend that mediates the complexity of manual parallelization and the use of less efficient directive-based programming models. Parallel primitive libraries allow software engineers to map any sequential code to a target many-core architecture by identifying the most computational intensive code sections and mapping them into one ore more existing primitives. On the other hand, the spreading of such a primitive-based programming model and the different GPU architectures have led to a large and increasing number of third-party libraries, which often provide different implementations of the same primitive, each one optimized for a specific architecture. From the developer point of view, this moves the actual problem of parallelizing the software application to selecting, among the several implementations, the most efficient primitives for the target platform. This paper presents Pro++, a profiling framework for GPU primitives that allows measuring the implementation quality of a given primitive by considering the target architecture characteristics. The framework collects the information provided by a standard GPU profiler and combines them into optimization criteria. The criteria evaluations are weighed to distinguish the impact of each optimization on the overall quality of the primitive implementation. The paper shows how the tuning of the different weights has been conducted through the analysis of five of the most widespread existing primitive libraries and how the framework has been eventually applied to improve the implementation performance of two standard and widespread primitives

    Testbench qualification of SystemC TLM protocols through Mutation Analysis

    Get PDF
    Transaction-level modeling (TLM) has become the de-facto reference modeling style for system-level design and verification of embedded systems. It allows designers to implement high-level communication protocols for simulations up to 1000x faster than at register-transfer level (RTL). To guarantee interoperability between TLM IP suppliers and users, designers implement the TLM communication protocols by relying on a reference standard, such as the standard OSCI for SystemC TLM. Functional correctness of such protocols as well as their compliance to the reference TLM standard are usually verified through user-defined testbenches, which high-quality and completeness play a key role for an efficient TLM design and verification flow. This article presents a methodology to apply mutation analysis, a technique applied in literature for SW testing, for measuring the testbench quality in verifying TLM protocols. In particular, the methodology aims at (i) qualifying the testbenches by considering both the TLM protocol correctness and their compliance to a defined standard (i.e., OSCI TLM), (ii) optimizing the simulation time during mutation analysis by avoiding mutation redundancies, and (iii) driving the designers in the testbench improvement. Experimental results on benchmarks of different complexity and architectural characteristics are reported to analyze the methodology applicability

    MIPP: A Microbenchmark Suite for Performance, Power, and Energy Consumption Characterization of GPU architectures

    Get PDF
    GPU-accelerated applications are becoming increasingly common in high-performance computing as well as in low-power heterogeneous embedded systems. Nevertheless, GPU programming is a challenging task, especially if a GPU application has to be tuned to fully take advantage of the GPU architectural configuration. Even more challenging is the application tuning by considering power and energy consumption, which have emerged as first-order design constraints in additionto performance. Solving bottlenecks of a GPU application such as high thread divergence or poor memory coalescing have a different impact on the overall performance, power and energy consumption. Such an impact also depends on the GPU device on which the application is run. This paper presents a suite of microbenchmarks, which are specialized chunks of GPU code that exercise specific device components (e.g., arithmetic instruction units, shared memory, cache, DRAM, etc.) and that provide the actual characteristics of such components in terms of throughput, power, and energy consumption. The suite aims at enriching standard profiler information and guiding the GPU application tuning on a specific GPU architecture by considering all three design constraints (i.e., power, performance, energy consumption). The paper presents the results obtained by applying the proposed suite to characterize two different GPU devices and to understand how application tuning may impact differently on them

    Test Generation Based on CLP

    Get PDF
    Functional ATPGs based on simulation are fast, but generally, they are unable to cover corner cases, and they cannot prove untestability. On the contrary, functional ATPGs exploiting formal methods, being exhaustive, cover corner cases, but they tend to suffer of the state explosion problem when adopted for verifying large designs. In this context, we have defined a functional ATPG that relies on the joint use of pseudo-deterministic simulation and Constraint Logic Programming (CLP), to generate high-quality test sequences for solving complex problems. Thus, the advantages of both simulation-based and static-based verification techniques are preserved, while their respective drawbacks are limited. In particular, CLP, a form of constraint programming in which logic programming is extended to include concepts from constraint satisfaction, is well-suited to be jointly used with simulation. In fact, information learned during design exploration by simulation can be effectively exploited for guiding the search of a CLP solver towards DUV areas not covered yet. The test generation procedure relies on constraint logic programming (CLP) techniques in different phases of the test generation procedure. The ATPG framework is composed of three functional ATPG engines working on three different models of the same DUV: the hardware description language (HDL) model of the DUV, a set of concurrent EFSMs extracted from the HDL description, and a set of logic constraints modeling the EFSMs. The EFSM paradigm has been selected since it allows a compact representation of the DUV state space that limits the state explosion problem typical of more traditional FSMs. The first engine is randombased, the second is transition-oriented, while the last is fault-oriented. The test generation is guided by means of transition coverage and fault coverage. In particular, 100% transition coverage is desired as a necessary condition for fault detection, while the bit coverage functional fault model is used to evaluate the effectiveness of the generated test patterns by measuring the related fault coverage. A random engine is first used to explore the DUV state space by performing a simulation-based random walk. This allows us to quickly fire easy-to-traverse (ETT) transitions and, consequently, to quickly cover easy-to-detect (ETD) faults. However, the majority of hard-to-traverse (HTT) transitions remain, generally, uncovered. Thus, a transition-oriented engine is applied to cover the remaining HTT transitions by exploiting a learning/backjumping-based strategy. The ATPG works on a special kind of EFSM, called SSEFSM, whose transitions present the most uniformly distributed probability of being activated and can be effectively integrated to CLP, since it allows the ATPG to invoke the constraint solver when moving between EFSM states. A constraint logic programming-based (CLP) strategy is adopted to deterministically generate test vectors that satisfy the guard of the EFSM transitions selected to be traversed. Given a transition of the SSEFSM, the solver is required to generate opportune values for PIs that enable the SSEFSM to move across such a transition. Moreover, backjumping, also known as nonchronological backtracking, is a special kind of backtracking strategy which rollbacks from an unsuccessful situation directly to the cause of the failure. Thus, the transition-oriented engine deterministically backjumps to the source of failure when a transition, whose guard depends on previously set registers, cannot be traversed. Next it modifies the EFSM configuration to satisfy the condition on registers and successfully comes back to the target state to activate the transition. The transition-oriented engine generally allows us to achieve 100% transition coverage. However, 100% transition coverage does not guarantee to explore all DUV corner cases, thus some hard-to-detect (HTD) faults can escape detection preventing the achievement of 100% fault coverage. Therefore, the CLP-based fault-oriented engine is finally applied to focus on the remaining HTD faults. The CLP solver is used to deterministically search for sequences that propagate the HTD faults observed, but not detected, by the random and the transition-oriented engine. The fault-oriented engine needs a CLP-based representation of the DUV, and some searching functions to generate test sequences. The CLP-based representation is automatically derived from the S2EFSM models according to the defined rules, which follow the syntax of the ECLiPSe CLP solver. This is not a trivial task, since modeling the evolution in time of an EFSM by using logic constraints is really different with respect to model the same behavior by means of a traditional HW description language. At first, the concept of time steps is introduced, required to model the SSEFSM evolution through the time via CLP. Then, this study deals with modeling of logical variables and constraints to represent enabling functions and update functions of the SSEFSM. Formal tools that exhaustively search for a solution frequently run out of resources when the state space to be analyzed is too large. The same happens for the CLP solver, when it is asked to find a propagation sequence on large sequential designs. Therefore we have defined a set of strategies that allow to prune the search space and to manage the complexity problem for the solver

    A SystemC Platform for Signal Transduction Modelling and Simulation in Systems Biology

    Get PDF
    Signal transduction is a class of cell\u2019s biological processes,which are commonly represented as highly concurrent reactive systems. In the Systems Biology community, modelling and simulation of signal transduction require overcoming issues like discrete event-based execution of complex systems, description from building blocks through composition and encapsulation, description at different levels of granularity, methods for abstraction and refinement. This paper presents a signal transduction modelling and simulationplatform based on SystemC, and shows how the platform allows handling the system complexity by modelling it at different abstraction levels. The paper reports the results obtained by applying the platform to model the intracellular signalling network controlling integrin activation mediating leukocyte recruitment from the blood into the tissues. The dynamic simulation of the model has been conducted with the aim of exploring oscillating behaviors of such a biochemical circuit and, more in general, to help better understanding properties of the overall dynamics of leukocyte recruitment

    Analog Defect Injection and Fault Simulation Techniques: A Systematic Literature Review

    Get PDF
    Since the last century, the exponential growth of the semiconductor industry has led to the creation of tiny and complex integrated circuits, e.g., sensors, actuators, and smart power. Innovative techniques are needed to ensure the correct functionality of analog devices that are ubiquitous in every smart system. The ISO 26262 standard for functional safety in the automotive context specifies that fault injection is necessary to validate all electronic devices. For decades, standardization of defect modeling and injection mainly focused on digital circuits and, in a minor part, on analog ones. An initial attempt is being made with the IEEE P2427 draft standard that started to give a structured and formal organization to the analog testing field. Various methods have been proposed in the literature to speed up the fault simulation of the defect universe for an analog circuit. A more limited number of papers seek to reduce the overall simulation time by reducing the number of defects to be simulated. This literature survey describes the state-of-the-art of analog defect injection and fault simulation methods. The survey is based on the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) methodological flow, allowing for a systematic and complete literature survey. Each selected paper has been categorized and presented to provide an overview of all the available approaches. In addition, the limitations of the various approaches are discussed by showing possible future directions
    • …
    corecore